Staged Multi-armed Bandits
نویسندگان
چکیده
In conventional multi-armed bandits (MAB) and other reinforcement learning methods, the learner sequentially chooses actions and obtains a reward (which can be possibly missing, delayed or erroneous) after each taken action. This reward is then used by the learner to improve its future decisions. However, in numerous applications, ranging from personalized patient treatment to personalized web-based education, the learner does not obtain rewards after each action, but only after sequences of actions are taken, intermediate feedbacks are observed, and a final decision is made based on which a reward is obtained. In this paper, we introduce a new class of reinforcement learning methods which can operate in such settings. We refer to this class as staged multi-armed bandits (S-MAB). S-MAB proceeds in rounds, each composed of several stages; in each stage, the learner chooses an action and observes a feedback signal. Upon each action selection a feedback signal is observed, whilst the reward of the selected sequence of actions is only revealed after the learner selects a stop action that ends the current round. The reward of the round depends both on the sequence of actions and the sequence of observed feedbacks. The goal of the learner is to maximize its total expected reward over all rounds by learning to choose the best sequence of actions based on the feedback it gets about these actions. First, we define an oracle benchmark, which sequentially selects the actions that maximize the expected immediate reward. This benchmark is known to be approximately optimal when the reward sequence associated with the selected actions is adaptive submodular. Then, we propose our online learning algorithm, for which we prove that the regret is logarithmic in the number of rounds and linear in the number of stages with respect to the oracle benchmark. We illustrate the performance of S-MAB using a personalized web-based education system aimed at providing remedial assistance for students in a Digital Signal Processing class.
منابع مشابه
Modal Bandits
Analyses of multi-armed bandits primarily presume that the value of an arm is its expected reward. We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions.
متن کاملHabilitation À Diriger Des Recherches De L'université Paris-est Agrégation Pac-bayésienne Et Bandits À Plusieurs Bras Pac-bayesian Aggregation and Multi-armed Bandits
متن کامل
Generic Exploration and K-armed Voting Bandits
We study a stochastic online learning scheme with partial feedback where the utility of decisions is only observable through an estimation of the environment parameters. We propose a generic pure-exploration algorithm, able to cope with various utility functions from multi-armed bandits settings to dueling bandits. The primary application of this setting is to offer a natural generalization of ...
متن کاملThe Epoch-Greedy Algorithm for Contextual Multi-armed Bandits
We present Epoch-Greedy, an algorithm for contextual multi-armed bandits (also known as bandits with side information). Epoch-Greedy has the following properties: 1. No knowledge of a time horizon T is necessary. 2. The regret incurred by Epoch-Greedy is controlled by a sample complexity bound for a hypothesis class. 3. The regret scales asO(T S) or better (sometimes, much better). Here S is th...
متن کاملSemi-Bandits with Knapsacks
We unify two prominent lines of work on multi-armed bandits: bandits with knapsacks and combinatorial semi-bandits. The former concerns limited “resources” consumed by the algorithm, e.g., limited supply in dynamic pricing. The latter allows a huge number of actions but assumes combinatorial structure and additional feedback to make the problem tractable. We define a common generalization, supp...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1508.00641 شماره
صفحات -
تاریخ انتشار 2015